Replication of Checkpoints in Recoverable DSM Systems
نویسندگان
چکیده
This paper presents a new technique of recovery for object-based Distributed Shared Memory (DSM) systems. The new technique, integrated with a coherence protocol for atomic consistency model, offers high availability of shared objects in spite of multiple node and communication failures, introducing little overhead. It ensures fast recovery in case of multiple node failures and enables a DSM system to circumvent the network partitioning, as far as a majority partition can be constituted.
منابع مشابه
Replication for Efficiency and Fault Tolerance in a Dsm System
Distributed Shared Memory (DSM) systems implemented on a network of workstations (NOW) have become a convenient alternative to shared memory archi-tectures to execute long running parallel applications. However, such architectures are susceptible to experience failures. This paper presents the design and implementation of a recoverable DSM (RDSM) based on a backward error recovery (BER) mechani...
متن کاملA Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM). Although most recover...
متن کاملIcare: Combining Efficiency and High-availability in a Dsm System
In light of the increasing throughput of local area networks, Networks Of Workstations (NOW) which provide a distributed shared memory (DSM) have become a convenient alternative to parallel architectures in the framework of parallel scientific applications. ICARE is a recoverable DSM based on backward error recovery which is implemented on top of an experiments ATM platform running the CHORUS m...
متن کاملAn Extended Coherence Protocol for Recoverable DSM Systems with Causal Consistency
This paper presents a coherence protocol for recoverable Distributed Shared Memory (DSM) systems with causally consistent read-write objects. It uses independent checkpointing tightly integrated with coherence operations. That integration results in high availability of shared objects and ensures fast restoration of the consistent state of DSM in spite of multiple node failures, introducing lit...
متن کاملRecoverable Distributed Shared Memory Using the Competitive Update Protocol
In this paper, we propose a recoverable DSM that uses a competitive update protocol. In this update protocol, multiple copies of each page may be maintainedat different nodes. However, it is also possible fora page to exist in only one node, as some copies of the page may be invalidated. We propose an implementation that makes the competitive update protocol recoverable from a single node failu...
متن کامل